Google Play Store Apps

Los base de datos usada fue extraída de: https://www.kaggle.com/lava18/google-play-store-apps#license.txt

In [1]:
# Librerías bases
import numpy as np
import pandas as pd
import scipy

# Para visualizar
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


# Para ML
#separar data entrenamiento y prueba
from sklearn.model_selection import train_test_split

#árbol de decisión
from sklearn import tree

#Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

#Revisar ML
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc

from sklearn.tree import export_graphviz
from subprocess import call
In [2]:
from os import path
from PIL import Image
In [3]:
pip install wordcloud
Requirement already satisfied: wordcloud in c:\users\rfuen\anaconda3\lib\site-packages (1.5.0)
Requirement already satisfied: numpy>=1.6.1 in c:\users\rfuen\anaconda3\lib\site-packages (from wordcloud) (1.16.2)
Requirement already satisfied: pillow in c:\users\rfuen\anaconda3\lib\site-packages (from wordcloud) (5.4.1)
Note: you may need to restart the kernel to use updated packages.
In [4]:
df=pd.read_csv('./googleplaystore.csv')

Análisis calidad de datos

In [5]:
df.head(5)
Out[5]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [6]:
rowsiniciales=df.shape[0]
rowsiniciales #10841
#df=df.dropna()
#rowsfinales=df.shape[0]
#rowsfinales
Out[6]:
10841
In [7]:
b=df.nunique()
b
Out[7]:
App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

Notemos que hay APP que están repetidas. Acá se pueden hacer dos cosas:

  • Revisar como varía el ratings u otras variables para una misma App. (aquí hay que revisar la fecha y versión para el análisis). Aunque puede ser que las filas sean simplemente iguales, es decir solo se duplique y no sean fechas distintas de análisis. Acá simplemente no se considerarán las app repetidas. (leer abajo)
  • Eliminar App repetidas para considerar las app una sola vez. Así los analisis son más certeros ya que todas las App tienen la misma importancia al final del análisis.
In [8]:
df[df.isnull().any(axis=1)].count()
Out[8]:
App               1481
Category          1481
Rating               7
Reviews           1481
Size              1481
Installs          1481
Type              1480
Price             1481
Content Rating    1480
Genres            1481
Last Updated      1481
Current Ver       1473
Android Ver       1478
dtype: int64

Varios columnas vacias

In [9]:
a=df.dtypes
a
Out[9]:
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

Notemos que solo Ratings está en formato número y nos gustaría que install, price, reviews y size también lo estuvieran (para realizar gráficos es fundamental).

Data Cleaning

In [10]:
df.drop_duplicates(subset='App', inplace=True)
In [11]:
df.nunique()
Out[11]:
App               9660
Category            34
Rating              40
Reviews           5331
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             119
Last Updated      1378
Current Ver       2818
Android Ver         33
dtype: int64
In [12]:
df.shape[0]
Out[12]:
9660

Note Que ahora tenemos el mismo número de filas que de App distintas!

In [13]:
df['Installs']=df.Installs.apply(lambda x: x.replace('+','') if '+' in str(x) else x)
In [14]:
#df.head(5) #notemos que esto está bien (quitamos el +) ahora transformemoslo a número!
In [15]:
#df['Installs']=df.Installs.apply(lambda x: int(x)) #notemos que arroja error porque hay comas!!!!!  (hay que eliminarlas)
In [16]:
df['Installs']=df.Installs.apply(lambda x: x.replace(',','') if ',' in str(x) else x)
In [17]:
#df['Installs']=df.Installs.apply(lambda x: int(x)) #arroga otro error porque dentro de las bases hay elementos que dicen Free
In [18]:
df=df[df['Installs']!='Free']
In [19]:
df['Installs']=df.Installs.apply(lambda x: int(x))
In [20]:
df.head(4)
Out[20]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10000 Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500000 Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5000000 Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50000000 Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
In [21]:
#df.dtypes #Install está listo

mismo ahora con price

In [22]:
#df['Price']=df.Price.apply(lambda x: float(x)) #molestan los $
In [23]:
df['Price']=df.Price.apply(lambda x: x.replace('$','') if '$' in str(x) else x)
In [24]:
df['Price']=df.Price.apply(lambda x: float(x)) #listo!
In [25]:
df['Reviews']=df.Reviews.apply(lambda x: int(x)) #listo!
In [26]:
df['Size']=df.Size.apply(lambda x: x.replace('M','') if 'M' in str(x) else x)
In [27]:
#df['Size']=df.Size.apply(lambda x: float(x)) #problema con 'Varies with device' 
#como argumento ( ojo puede ser útil está variable) se puede ver si se correlaciona positivamente con el ratings 
#esto es porque intuitivamente si da libertad de peso de acuerdo al dispositivo deberìa funcionar mejor en todo tipo de dispositivo
#esto es solo una teoría 
In [28]:
df[df['Size']=='Varies with device'].Rating.mean()
Out[28]:
4.249101796407182
In [29]:
df[df['Size']!='Varies with device'].Rating.mean()    
Out[29]:
4.160623310089655

Notemos que la intuición era correcta!, sin embargo la diferenca es muy poca y sacaremos los valores que tengan 'Varies with device' para trabajar con está variable como número.

In [30]:
df['Size']=df.Size.apply(lambda x: x.replace('Varies with device','NaN') if 'Varies with device' in str(x) else x)
In [31]:
#df['Size']=df.Size.apply(lambda x: float(x)) notemos que hay aplicaciones que pesan kilobytes
In [32]:
df['Size'] = df.Size.apply(lambda x: float(x.replace('k', '')) / 1000 if 'k' in str(x) else x)
In [33]:
del df['Current Ver'] #la versión del aplicativo (se sacará está variable) 
In [34]:
df.head(4)
Out[34]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19 10000 Free 0.0 Everyone Art & Design January 7, 2018 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14 500000 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 Everyone Art & Design August 1, 2018 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25 50000000 Free 0.0 Teen Art & Design June 8, 2018 4.2 and up
In [35]:
df[df.isnull().any(axis=1)].count()
Out[35]:
App               1465
Category          1465
Rating               2
Reviews           1465
Size              1465
Installs          1465
Type              1464
Price             1465
Content Rating    1465
Genres            1465
Last Updated      1465
Android Ver       1463
dtype: int64
In [36]:
df=df.dropna()
In [37]:
df.nunique()
Out[37]:
App               8194
Category            33
Rating              39
Reviews           5321
Size               414
Installs            19
Type                 2
Price               73
Content Rating       6
Genres             114
Last Updated      1300
Android Ver         31
dtype: int64
In [38]:
df['Size'] = df.Size.apply(lambda x: float(x))
In [39]:
df.dtypes
Out[39]:
App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Android Ver        object
dtype: object
In [40]:
df[df.isnull().any(axis=1)].count()
Out[40]:
App               1169
Category          1169
Rating            1169
Reviews           1169
Size                 0
Installs          1169
Type              1169
Price             1169
Content Rating    1169
Genres            1169
Last Updated      1169
Android Ver       1169
dtype: int64
In [41]:
df=df.dropna()
In [42]:
df.nunique()
Out[42]:
App               7025
Category            33
Rating              39
Reviews           4295
Size               412
Installs            19
Type                 2
Price               68
Content Rating       6
Genres             111
Last Updated      1279
Android Ver         31
dtype: int64
In [43]:
df[df.isnull().any(axis=1)].count()
Out[43]:
App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Android Ver       0
dtype: int64

Ordenando los datos

Hipotesis (revisar si la intuición es correcta) [Lluvia de ideas]

  • Las App gratuitas tienen mayor Installs.
  • Los Rating varían de acuerdo a Categoria, categorias más de nicho tienen una distribución de ratings y review distinta al resto porque las personas que las usan tienen algún conocimiento específicos que las haces más críticos o bien más agradecidos. (ver cómo testear está hipótesis)
  • Si la App es gratuita depende de la categoría, categorías con mayor competencia tienen más App gratuitas.
  • App pagadas dentro de categorias donde hay más aplicaciones gratuitas tienen muy bajo Install. (gente es más critica con ellas)
  • App más livianas dentro de la misma categoría tienen más descargas y/o mejor nota.
  • Aplicaciones con versiones de Android más nuevas tienen peor nota ya que hay muchos usurios sin poderlas usar correctamente.
  • Considerar cada categoria como una industria en el sentido que cada una tiene su rentabilidad y una variación de rentabilidad (recordar: https://aprendeingenieria.com/evidencia-cientifica-estrategia-rentabilidad/). La rentabilidad puede ser una función del Rating y del número de Install. (por ejemplo se multiplican ambas variables. Si está el rating en el promedio se normaliza a 1). Todas las App deben ganar de alguna forma ya sea por publicidad o por que son pagadas.

  • Analizar información en base a las cosas que se deciden y las que son un resultado.

    • Deciden: Peso APP, última actualización, Android Ver, categorías, generos, Content Rating y precio
    • Resultados: Installs, Ratings, Reviews

Ideas interesante 1:

  • El ratings sobre la nota promedio o mediana (definir en base a los datos) deben tener un indice aumentante y. En el promedio y=0 y bajo el promedio un número negativo. Con y una métrica que es:
    • y= (rating_APP - rating_promedio_categoría)*(1+x).
  • Por otro lado x es:
    • 0,2 si la APP tiene más reviews que el promedio de reviews de la categoría y -0,2 en caso contrarío.
  • Note que x es el consenso que se tiene en torno a la nota de la aplicación. E Y indica la calidad. Una aplicacón con muchos reviews y buena nota será una aplicación con un consenso en torno a ella que es buena (en relación a otras de la misma categoría.) Finalmente está componente de calidad se multiplicará asó con Install: Install*(5+y)/5 (se le suma 5 al y para que está componente siempre sea positiva y se divide por 5 para "normalizarla". Ya que una APP promedio tiene y=0) La instuición de está metrica es: bajo el supuesto que el valor generado de una aplicación viene dado por la calidad percibida (esto se correlaciona con su uso y periodo que se tiene app en el celular) y el número de usuarios que la descangan. (obviamente esto aplica muy bien en aplicaciones gratuitas con publicidad.)

Idea interesante 2: Antes de meterse a hacer una APP en una categoria conviene saber, cuales son las APP más populares de ella y las más comentadas por sus usuarios. La primera variable es simplemente la APP con más Install en cada categoria. La segunda variable es el ratio entre reviews/Install. Se calcula para cada APP. Las APP con está metrica más altas son las APP más "polémicas". Ya que tienen una taza alta de que al ser descargada su usuario haga un review. aca solo se trabajará con más review y ellas serán consideradas las aplicaciones tipo de la categoria.

In [44]:
df.describe() #la mediana del rating es 4.3
Out[44]:
Rating Reviews Size Installs Price
count 7025.000000 7.025000e+03 7025.000000 7.025000e+03 7025.000000
mean 4.160541 1.448170e+05 21.758756 4.469479e+06 1.173694
std 0.559203 1.024141e+06 22.728166 2.714153e+07 18.200187
min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000
25% 4.000000 8.400000e+01 4.900000 1.000000e+04 0.000000
50% 4.300000 1.546000e+03 13.000000 1.000000e+05 0.000000
75% 4.500000 2.657200e+04 31.000000 1.000000e+06 0.000000
max 5.000000 4.489172e+07 100.000000 1.000000e+09 400.000000
In [45]:
number_of_apps_in_Category_free = df[df['Type']=='Free'].Category.value_counts().sort_values(ascending=False)
number_of_apps_in_Category_free=number_of_apps_in_Category_free.to_frame()
number_of_apps_in_Category_free.columns=['Num_APP_free']

number_of_apps_in_Category_paid = df[df['Type']=='Paid'].Category.value_counts().sort_values(ascending=False)
number_of_apps_in_Category_paid=number_of_apps_in_Category_paid.to_frame()
number_of_apps_in_Category_paid.columns=['Num_APP_paid']



#competence = number_of_apps_in_Category_paid.merge(number_of_apps_in_Category_free, on='Category', how='right')
In [46]:
#number_of_apps_in_Category_free
In [47]:
paid_vs_free= number_of_apps_in_Category_free.merge(number_of_apps_in_Category_paid, how='outer',  left_index=True, right_index=True)
#free_vs_paid
In [48]:
paid_vs_free=paid_vs_free.fillna(0)
#paid_vs_free
In [49]:
paid_vs_free=paid_vs_free.assign(paid_vs_free= lambda x: (x.Num_APP_paid)/(x.Num_APP_free))
In [50]:
paid_vs_free.sort_values(by=['paid_vs_free'], ascending=False)
Out[50]:
Num_APP_free Num_APP_paid paid_vs_free
PERSONALIZATION 213 61.0 0.286385
MEDICAL 212 54.0 0.254717
WEATHER 44 6.0 0.136364
COMMUNICATION 170 18.0 0.105882
FAMILY 1368 144.0 0.105263
SPORTS 201 20.0 0.099502
GAME 758 74.0 0.097625
TOOLS 571 55.0 0.096322
PHOTOGRAPHY 191 13.0 0.068063
PRODUCTIVITY 209 14.0 0.066986
LIFESTYLE 253 16.0 0.063241
ART_AND_DESIGN 56 3.0 0.053571
FINANCE 245 13.0 0.053061
BOOKS_AND_REFERENCE 134 7.0 0.052239
EDUCATION 84 4.0 0.047619
MAPS_AND_NAVIGATION 90 4.0 0.044444
TRAVEL_AND_LOCAL 135 6.0 0.044444
HEALTH_AND_FITNESS 183 8.0 0.043716
BUSINESS 214 8.0 0.037383
PARENTING 43 1.0 0.023256
DATING 120 2.0 0.016667
ENTERTAINMENT 63 1.0 0.015873
FOOD_AND_DRINK 71 1.0 0.014085
SHOPPING 144 2.0 0.013889
NEWS_AND_MAGAZINES 152 2.0 0.013158
SOCIAL 154 2.0 0.012987
VIDEO_PLAYERS 111 1.0 0.009009
COMICS 47 0.0 0.000000
LIBRARIES_AND_DEMO 61 0.0 0.000000
AUTO_AND_VEHICLES 63 0.0 0.000000
BEAUTY 37 0.0 0.000000
EVENTS 38 0.0 0.000000
HOUSE_AND_HOME 50 0.0 0.000000

Mostrar en sitio está tabla: Esta tabla está ordenada dada la proporción entre aplicaciones pagadas y gratuitas por categorías. Pese, a que si la aplicación es pagada o no depende del modelo de negocio de la compañia, es importante que antes de decidir esto conozca su mercado. Si usted desea crear una aplicación revise está tabla y vea como se comportan las otras aplicaciones dentro de está categoría. (https://aprendeingenieria.com/modelos-de-negocio-y-estrategia/)

In [51]:
df.columns
Out[51]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Android Ver'],
      dtype='object')
In [52]:
Rating_promedio=df[['Category','Rating']].groupby('Category').mean()
Rating_promedio=Rating_promedio.sort_values(by=['Rating'], ascending=False)
Rating_promedio.columns=['Rating_promedio']
Rating_promedio
Out[52]:
Rating_promedio
Category
EVENTS 4.478947
EDUCATION 4.373864
ART_AND_DESIGN 4.361017
PARENTING 4.347727
PERSONALIZATION 4.324453
BOOKS_AND_REFERENCE 4.322695
BEAUTY 4.291892
SOCIAL 4.257692
WEATHER 4.242000
GAME 4.235697
SHOPPING 4.213014
LIBRARIES_AND_DEMO 4.203279
SPORTS 4.200905
HEALTH_AND_FITNESS 4.191099
FAMILY 4.179497
COMICS 4.168085
MEDICAL 4.162406
ENTERTAINMENT 4.154688
AUTO_AND_VEHICLES 4.147619
NEWS_AND_MAGAZINES 4.143506
PRODUCTIVITY 4.132735
HOUSE_AND_HOME 4.128000
PHOTOGRAPHY 4.114216
FOOD_AND_DRINK 4.109722
FINANCE 4.104651
BUSINESS 4.096396
LIFESTYLE 4.089963
COMMUNICATION 4.076596
VIDEO_PLAYERS 4.021429
TRAVEL_AND_LOCAL 4.011348
MAPS_AND_NAVIGATION 4.008511
TOOLS 4.005911
DATING 3.963934
In [53]:
mas_reviews=df.sort_values(by=['Reviews'], ascending=False) #df.drop_duplicates(subset='App', inplace=True)
mas_reviews.drop_duplicates(subset='Category', keep='first',inplace=True)
In [54]:
mas_reviews #Aplicaciones con más comentarios
Out[54]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Android Ver
1670 Clash of Clans GAME 4.6 44891723 98.0 100000000 Free 0.0 Everyone 10+ Strategy July 15, 2018 4.1 and up
378 UC Browser - Fast Download Private & Secure COMMUNICATION 4.5 17712922 40.0 500000000 Free 0.0 Teen Communication August 2, 2018 4.0 and up
8896 DU Battery Saver - Battery Charger & Battery Life TOOLS 4.5 13479633 14.0 100000000 Free 0.0 Everyone Tools June 5, 2018 4.0 and up
3975 Hay Day FAMILY 4.5 10053186 94.0 100000000 Free 0.0 Everyone Casual June 28, 2018 4.0.3 and up
4688 VivaVideo - Video Editor & Photo Movie VIDEO_PLAYERS 4.6 9879473 40.0 100000000 Free 0.0 Teen Video Players & Editors August 4, 2018 4.1 and up
2938 PicsArt Photo Studio: Collage Maker & Pic Editor PHOTOGRAPHY 4.5 7594559 34.0 100000000 Free 0.0 Teen Photography August 6, 2018 4.0.3 and up
3360 CM Launcher 3D - Theme, Wallpapers, Efficient PERSONALIZATION 4.6 6702776 17.0 100000000 Free 0.0 Teen Personalization August 3, 2018 4.0.3 and up
2655 Wish - Shopping Made Fun SHOPPING 4.5 6210998 15.0 100000000 Free 0.0 Everyone Shopping August 3, 2018 4.1 and up
3945 Tik Tok - including musical.ly SOCIAL 4.4 5637451 59.0 100000000 Free 0.0 Teen Social August 3, 2018 4.1 and up
3469 ES File Explorer File Manager PRODUCTIVITY 4.6 5383985 16.0 100000000 Free 0.0 Everyone Productivity August 3, 2018 4.0 and up
8445 FIFA Soccer SPORTS 4.2 3909032 51.0 100000000 Free 0.0 Everyone Sports July 31, 2018 4.1 and up
4587 Tinder LIFESTYLE 4.0 2789775 68.0 100000000 Free 0.0 Mature 17+ Lifestyle August 2, 2018 4.4 and up
4725 Weather & Clock Widget for Android WEATHER 4.4 2371543 11.0 50000000 Free 0.0 Everyone Weather June 4, 2018 4.0.3 and up
874 Talking Angela ENTERTAINMENT 3.7 1828284 52.0 100000000 Free 0.0 Everyone Entertainment July 12, 2018 4.1 and up
3828 GPS Navigation & Offline Maps Sygic MAPS_AND_NAVIGATION 4.4 1421884 33.0 50000000 Free 0.0 Everyone Maps & Navigation July 26, 2018 4.0.3 and up
1173 Chase Mobile FINANCE 4.6 1374549 32.0 10000000 Free 0.0 Everyone Finance July 23, 2018 5.0 and up
194 OfficeSuite : Free Office + PDF Editor BUSINESS 4.3 1002861 35.0 100000000 Free 0.0 Everyone Business August 2, 2018 4.1 and up
3736 Google News NEWS_AND_MAGAZINES 3.9 877635 13.0 1000000000 Free 0.0 Teen News & Magazines August 1, 2018 4.4 and up
3122 GasBuddy: Find Cheap Gas TRAVEL_AND_LOCAL 4.6 751551 42.0 10000000 Free 0.0 Mature 17+ Travel & Local August 2, 2018 4.4 and up
7229 Pregnancy Tracker & Countdown to Baby Due Date PARENTING 4.7 658087 62.0 10000000 Free 0.0 Everyone Parenting May 24, 2018 5.0 and up
1183 Tastely FOOD_AND_DRINK 4.7 611136 19.0 10000000 Free 0.0 Everyone Food & Drink July 13, 2018 4.0.3 and up
1361 Period Tracker Clue: Period and Ovulation Tracker HEALTH_AND_FITNESS 4.8 570242 20.0 10000000 Free 0.0 Everyone Health & Fitness August 2, 2018 4.1 and up
5323 Al Quran Indonesia BOOKS_AND_REFERENCE 4.8 445756 16.0 10000000 Free 0.0 Everyone Books & Reference May 15, 2018 4.0 and up
1446 Zillow: Find Houses for Sale & Apartments for ... HOUSE_AND_HOME 4.5 417907 34.0 10000000 Free 0.0 Everyone House & Home August 1, 2018 4.4 and up
718 Math Tricks EDUCATION 4.5 342918 8.1 10000000 Free 0.0 Everyone Education July 29, 2018 4.0 and up
10729 MX Player Codec (ARMv7) LIBRARIES_AND_DEMO 4.3 332083 6.3 10000000 Free 0.0 Everyone Libraries & Demo April 23, 2018 4.0 and up
483 OkCupid Dating DATING 4.1 285726 15.0 10000000 Free 0.0 Mature 17+ Dating July 30, 2018 4.1 and up
72 Android Auto - Maps, Media, Messaging & Voice AUTO_AND_VEHICLES 4.2 271920 16.0 10000000 Free 0.0 Teen Auto & Vehicles July 11, 2018 5.0 and up
304 Manga Rock - Best Manga Reader COMICS 4.4 238970 28.0 1000000 Free 0.0 Teen Comics July 9, 2018 5.0 and up
19 ibis Paint X ART_AND_DESIGN 4.6 224399 31.0 10000000 Free 0.0 Everyone Art & Design July 30, 2018 4.1 and up
2319 My Calendar - Period Tracker MEDICAL 4.7 156410 14.0 5000000 Free 0.0 Everyone Medical August 3, 2018 4.1 and up
99 ipsy: Makeup, Beauty, and Tips BEAUTY 4.9 49790 14.0 1000000 Free 0.0 Everyone Beauty November 9, 2017 4.1 and up
1005 Ticketmaster Event Tickets EVENTS 4.0 40113 36.0 5000000 Free 0.0 Everyone Events July 23, 2018 Varies with device

Estas son aplicaciones relativamente dominantes por categorias. tienen la mejor nota y son conocidas.

In [55]:
mejor_rating=df[df['Reviews']>=10000].sort_values(by=['Rating'], ascending=False) # considerar solo aplicaciones con varios reviews
mejor_rating.drop_duplicates(subset='Category', keep='first',inplace=True)
mejor_rating
Out[55]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Android Ver
1260 Six Pack in 30 Days - Abs Workout HEALTH_AND_FITNESS 4.9 272337 13.000 10000000 Free 0.00 Everyone Health & Fitness June 21, 2018 4.2 and up
79 Tickets + PDA 2018 Exam AUTO_AND_VEHICLES 4.9 197136 38.000 1000000 Free 0.00 Everyone Auto & Vehicles July 15, 2018 4.1 and up
4975 Solitaire: Decked Out Ad Free GAME 4.9 37302 35.000 500000 Free 0.00 Everyone Card May 8, 2017 4.1 and up
7000 PixPanda - Color by Number Pixel Art Coloring ... FAMILY 4.9 55723 14.000 1000000 Free 0.00 Everyone Entertainment June 4, 2018 4.0.3 and up
99 ipsy: Makeup, Beauty, and Tips BEAUTY 4.9 49790 14.000 1000000 Free 0.00 Everyone Beauty November 9, 2017 4.1 and up
10254 FC Porto SPORTS 4.9 15883 21.000 100000 Free 0.00 Everyone Sports June 19, 2018 4.0.3 and up
712 Learn Japanese, Korean, Chinese Offline & Free EDUCATION 4.9 133136 26.000 1000000 Free 0.00 Everyone Education;Education July 20, 2018 4.2 and up
5323 Al Quran Indonesia BOOKS_AND_REFERENCE 4.8 445756 16.000 10000000 Free 0.00 Everyone Books & Reference May 15, 2018 4.0 and up
3848 GPS Speedometer and Odometer MAPS_AND_NAVIGATION 4.8 15865 3.300 1000000 Free 0.00 Everyone Maps & Navigation August 3, 2018 4.1 and up
4038 DU Recorder – Screen Recorder, Video Editor, Live VIDEO_PLAYERS 4.8 2588730 9.700 50000000 Free 0.00 Everyone Video Players & Editors July 30, 2018 5.0 and up
3658 Weather Live Pro WEATHER 4.8 17493 11.000 100000 Paid 4.49 Everyone Weather April 20, 2018 4.4 and up
1092 Even - organize your money, get paid early FINANCE 4.8 12304 21.000 100000 Free 0.00 Everyone Finance August 2, 2018 5.0 and up
6407 WebComics COMICS 4.8 33783 6.400 1000000 Free 0.00 Teen Comics July 28, 2018 4.1 and up
438 Should I Answer? COMMUNICATION 4.8 237468 8.800 1000000 Free 0.00 Everyone Communication July 26, 2018 4.0 and up
1647 Nature Sounds LIFESTYLE 4.8 28588 24.000 1000000 Free 0.00 Everyone Lifestyle March 1, 2018 4.0.3 and up
4292 KPOP Amino for K-Pop Entertainment SOCIAL 4.8 19047 63.000 100000 Free 0.00 Teen Social July 13, 2018 4.0.3 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39.000 100000 Paid 4.99 Everyone Business April 11, 2017 3.0 and up
3303 Calculator with Percent (Free) TOOLS 4.8 48211 7.400 1000000 Free 0.00 Everyone Tools November 18, 2017 4.1 and up
2803 FreePrints – Free Photos Delivered PHOTOGRAPHY 4.8 109500 37.000 1000000 Free 0.00 Everyone Photography August 2, 2018 4.1 and up
2333 Pregnancy Calculator and Tracker app MEDICAL 4.8 69126 61.000 1000000 Free 0.00 Everyone Medical June 1, 2018 4.1 and up
3141 Yoriza Pension - travel, lodging, pension, cam... TRAVEL_AND_LOCAL 4.8 17882 86.000 1000000 Free 0.00 Everyone Travel & Local August 3, 2018 5.0 and up
6213 ALL-IN-ONE PACKAGE TRACKING PRODUCTIVITY 4.7 167406 18.000 1000000 Free 0.00 Everyone Productivity July 23, 2018 4.0.3 and up
882 🔥 Football Wallpapers 4K | Full HD Backgrounds 😍 ENTERTAINMENT 4.7 11661 4.000 1000000 Free 0.00 Everyone Entertainment July 14, 2018 4.0.3 and up
1183 Tastely FOOD_AND_DRINK 4.7 611136 19.000 10000000 Free 0.00 Everyone Food & Drink July 13, 2018 4.0.3 and up
5335 Al Mayadeen NEWS_AND_MAGAZINES 4.7 13620 9.000 500000 Free 0.00 Everyone News & Magazines April 16, 2018 4.1 and up
7229 Pregnancy Tracker & Countdown to Baby Due Date PARENTING 4.7 658087 62.000 10000000 Free 0.00 Everyone Parenting May 24, 2018 5.0 and up
7538 CM Launcher 3D Pro💎 PERSONALIZATION 4.7 23802 0.173 100000 Paid 4.99 Everyone Personalization November 17, 2016 4.0 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.700 5000000 Free 0.00 Everyone Art & Design August 1, 2018 4.0.3 and up
2771 ASOS SHOPPING 4.7 181798 22.000 10000000 Free 0.00 Everyone Shopping July 30, 2018 4.4 and up
1466 Redfin Real Estate HOUSE_AND_HOME 4.6 36857 19.000 1000000 Free 0.00 Everyone House & Home July 25, 2018 5.0 and up
515 Dating for 50 plus Mature Singles – FINALLY DATING 4.6 13046 13.000 500000 Free 0.00 Mature 17+ Dating July 31, 2018 4.1 and up
1549 Cool Popular Ringtones 2018 🔥 LIBRARIES_AND_DEMO 4.5 60170 28.000 1000000 Free 0.00 Everyone Libraries & Demo July 13, 2018 5.0 and up
1011 SeatGeek – Tickets to Sports, Concerts, Broadway EVENTS 4.4 15558 26.000 1000000 Free 0.00 Everyone Events August 3, 2018 5.0 and up

Note que está tabla es interesante ya que dice que en la categoria de eventos la mejor aplicación tiene 4,4 (mala nota). Notese que se tiene la restricción de sobre los 10.000 reviews (esto para tener una nota validada por varios usuarios). Es decir, puede ser interesante incursionar con una aplicación en esta categoria.

In [56]:
df_review=df[df['Reviews']>=1000]

Rating_promedio2=df_review[['Category','Rating']].groupby('Category').mean()
Rating_promedio2=Rating_promedio2.sort_values(by=['Rating'], ascending=False)
Rating_promedio2.columns=['Rating_promedio']
Rating_promedio2
Out[56]:
Rating_promedio
Category
ART_AND_DESIGN 4.473913
EDUCATION 4.411429
BOOKS_AND_REFERENCE 4.408197
HEALTH_AND_FITNESS 4.388889
EVENTS 4.358333
PARENTING 4.355000
MEDICAL 4.353704
PERSONALIZATION 4.326515
SOCIAL 4.322222
FINANCE 4.321154
AUTO_AND_VEHICLES 4.318519
BEAUTY 4.307143
WEATHER 4.295349
PRODUCTIVITY 4.273950
GAME 4.271626
SHOPPING 4.257843
NEWS_AND_MAGAZINES 4.246269
BUSINESS 4.244776
SPORTS 4.241304
MAPS_AND_NAVIGATION 4.231111
FOOD_AND_DRINK 4.204255
FAMILY 4.202445
PHOTOGRAPHY 4.200709
TOOLS 4.193525
COMMUNICATION 4.187879
VIDEO_PLAYERS 4.183099
ENTERTAINMENT 4.175410
HOUSE_AND_HOME 4.140541
COMICS 4.132000
TRAVEL_AND_LOCAL 4.122727
LIBRARIES_AND_DEMO 4.104545
LIFESTYLE 4.086538
DATING 3.973333

Será esta diferencia estadísticamente significativa?

Análsis visual

In [57]:
df_short=df[df['Reviews']>=100]
df_fix=df_short.copy()
#df_short

#ESTO ES DEBIDO A QUE SON MUY POCOS APP CON ESTÁS CARACTERÍSTICAS
df_fix=df_fix[df_fix['Content Rating'] != 'Adults only 18+']
df_fix=df_fix[df_fix['Content Rating'] != 'Unrated']
In [58]:
df_fix['installs']=df_fix['Installs'].apply(lambda x: np.log(x))
df_fix['reviews']=df_fix['Reviews'].apply(lambda x: np.log(x))
In [59]:
number_of_apps_in_category = df_short['Category'].value_counts().sort_values(ascending=False)
df_short.columns
Out[59]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Android Ver'],
      dtype='object')
In [60]:
number_of_apps_in_category
Out[60]:
FAMILY                 1099
GAME                    751
TOOLS                   430
PERSONALIZATION         188
SPORTS                  185
FINANCE                 183
LIFESTYLE               174
PHOTOGRAPHY             166
PRODUCTIVITY            155
HEALTH_AND_FITNESS      148
COMMUNICATION           132
SHOPPING                122
MEDICAL                 117
SOCIAL                  115
BUSINESS                113
NEWS_AND_MAGAZINES      104
TRAVEL_AND_LOCAL         98
BOOKS_AND_REFERENCE      94
DATING                   89
EDUCATION                85
VIDEO_PLAYERS            83
MAPS_AND_NAVIGATION      66
ENTERTAINMENT            64
FOOD_AND_DRINK           61
ART_AND_DESIGN           50
HOUSE_AND_HOME           47
WEATHER                  47
AUTO_AND_VEHICLES        46
COMICS                   40
LIBRARIES_AND_DEMO       38
PARENTING                33
BEAUTY                   24
EVENTS                   22
Name: Category, dtype: int64
In [61]:
number_of_apps_in_category = df_fix['Category'].value_counts()
number_of_apps_in_category


labels = number_of_apps_in_category.index
values = number_of_apps_in_category.values


fig1, ax1 = plt.subplots(figsize=(10, 15))
ax1.pie(values, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Distribución Categorías", size = 20)
plt.show()
In [62]:
number_of_apps_for_type = df_fix['Type'].value_counts()
number_of_apps_for_type


labels = number_of_apps_for_type.index
values = number_of_apps_for_type.values


fig1, ax1 = plt.subplots(figsize=(10, 15))
ax1.pie(values, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Distribución de APP pagadas y gratuitas", size = 20)
plt.show()
In [63]:
g=sns.pairplot(df_fix,vars=['Rating', 'reviews', 'Size', 'installs'], hue='Type')

Análisis:

Se ha marcado en negrita los resultados no triviales de estos gráficos.

Viendo solo la diagonal de este gráfico se observa que:

Las aplicaciones pagadas tienen mucho menos instalaciones y reviews (A ambos atributos se le aplico logaritmo). Pero mejor Rating.

Al mirar a los gráficos no presentes en las diagonales.

Las aplicaciones pagadas tienen un review bastante alto dado sus instalaciones.

En general las aplicaciones con mejor nota tiene más descarga.

En la distribución de tamaño en MB las pagadas y gratis distribuyen relativamente igual.

In [64]:
number_of_apps_in_content_rating = df_fix['Content Rating'].value_counts()
number_of_apps_in_content_rating


labels = number_of_apps_in_content_rating.index
values = number_of_apps_in_content_rating.values




fig1, ax1 = plt.subplots(figsize=(10, 15))
ax1.pie(values, autopct='%1.1f%%',
         startangle=90)
ax1.axis('equal')
plt.title("Distribución del Content Rating", size = 20)

ax1.legend(labels,
          title="Content Rating",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.show()
In [65]:
a=sns.pairplot(df_fix,vars=['Rating', 'reviews', 'Size', 'installs'], hue='Content Rating')

(Dado los pocos datos de Adults only 18+ y Unrated no se considerarán en el análisis.) (note que los colores se inviertieron en mature 17+ y everyone 10+) De la diagonal del gráfico se observa que: las APP "para todos los sobre 10" años son las que distribuyen de forma más diferente al resto. Tienen más rating y más review para una misma cantidad de install que las App de otros contenidos. Por otra parte categorías para everyone son las que tienen menos review para la misma cantidad de install y son las App más livianas.

Al analizar la categoría everyone 10+ se observa que está altamente cargada a los juegos.

In [66]:
number_of_apps_in_category = df_fix['Category'].value_counts().sort_values(ascending=False)
#number_of_apps_in_category
In [67]:
number_of_apps_in_category = df[df['Content Rating']=='Everyone 10+'].Category.value_counts().sort_values(ascending=False)
#number_of_apps_in_category

Las App de content rating igual a everyone 10+ son principalmente juegos por ende es "esperable que pesen más MB" que el resto.

In [68]:
number_of_apps_in_category = df[df['Content Rating']=='Everyone'].Category.value_counts().sort_values(ascending=False)
#number_of_apps_in_category
In [69]:
df_fix1=df_fix[df_fix['Price']!=0]
In [70]:
p = sns.stripplot(x="Price", y="Content Rating", data=df_fix1, jitter=True, linewidth=1)
In [71]:
p = sns.stripplot(x="Price", y="Content Rating", data=df_fix1[df_fix1['Price']<100], jitter=True, linewidth=1)

Pese a que everyone 10+ y mature 17+ tienen casi la misma cantidad de Apps, Everyone 10+ posee mucha más Apps pagadas.

In [72]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="Price", y="Category", data=df_fix1[df_fix1['Price']<100], jitter=True, linewidth=1)

Ahora que sabe esta distribución por precio, ¿Cuál es el mejor precio para su App?

In [73]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="Size", y="Category", data=df_fix1, jitter=True, linewidth=1)

Anova test

In [74]:
import scipy.stats as stats
In [75]:
anova = stats.f_oneway(df_fix.loc[df.Category == 'TOOLS']['Rating'], 
               df_fix.loc[df.Category == 'FAMILY']['Rating'],
               df_fix.loc[df.Category == 'GAME']['Rating'],
              
              )

print(anova)
F_onewayResult(statistic=24.343855636499324, pvalue=3.4601258117106465e-11)
In [76]:
kruskal=stats.kruskal(df_fix.loc[df.Category == 'TOOLS']['Rating'], 
               df_fix.loc[df.Category == 'FAMILY']['Rating'],
               df_fix.loc[df.Category == 'GAME']['Rating'],
              
              )

print(kruskal)
KruskalResult(statistic=32.744794386451225, pvalue=7.754579334441732e-08)

Entendiendo con ML un poco más el problema

In [77]:
df_fix.describe()
Out[77]:
Rating Reviews Size Installs Price installs reviews
count 5166.000000 5.166000e+03 5166.000000 5.166000e+03 5166.000000 5166.000000 5166.000000
mean 4.184340 1.969048e+05 24.586215 6.076344e+06 1.083802 12.908475 8.942595
std 0.439476 1.190009e+06 23.836503 3.149674e+07 18.079798 2.551472 2.646610
min 1.600000 1.000000e+02 0.008500 5.000000e+02 0.000000 6.214608 4.605170
25% 4.000000 8.070000e+02 6.200000 1.000000e+05 0.000000 11.512925 6.693324
50% 4.300000 7.171000e+03 16.000000 5.000000e+05 0.000000 13.122363 8.877796
75% 4.500000 5.264025e+04 36.000000 5.000000e+06 0.000000 15.424948 10.871236
max 5.000000 4.489172e+07 100.000000 1.000000e+09 400.000000 20.723266 17.619764
In [78]:
df_fix['rating']=0
df_fix['rating']=(df_fix['Rating']>=4.6)*1
df_fix.rating.value_counts()
Out[78]:
0    4264
1     902
Name: rating, dtype: int64
In [79]:
y=df_fix['rating'].copy()
In [80]:
category=pd.get_dummies(df_fix.Category)
genres=pd.get_dummies(df_fix.Genres)
content_rating=pd.get_dummies(df_fix['Content Rating'])

obj=[df_fix,category,content_rating]
df_fix=pd.concat(obj,axis=1)
df_fix.head(3)
Out[80]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres ... SOCIAL SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS WEATHER Everyone Everyone 10+ Mature 17+ Teen
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000 Free 0.0 Everyone Art & Design ... 0 0 0 0 0 0 1 0 0 0
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000 Free 0.0 Everyone Art & Design;Pretend Play ... 0 0 0 0 0 0 1 0 0 0
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 Everyone Art & Design ... 0 0 0 0 0 0 1 0 0 0

3 rows × 52 columns

In [81]:
df_fix.columns
Out[81]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Android Ver',
       'installs', 'reviews', 'rating', 'ART_AND_DESIGN', 'AUTO_AND_VEHICLES',
       'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FAMILY', 'FINANCE',
       'FOOD_AND_DRINK', 'GAME', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'MAPS_AND_NAVIGATION', 'MEDICAL',
       'NEWS_AND_MAGAZINES', 'PARENTING', 'PERSONALIZATION', 'PHOTOGRAPHY',
       'PRODUCTIVITY', 'SHOPPING', 'SOCIAL', 'SPORTS', 'TOOLS',
       'TRAVEL_AND_LOCAL', 'VIDEO_PLAYERS', 'WEATHER', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen'],
      dtype='object')
In [82]:
number_of_apps_in_category
Out[82]:
FAMILY                 1168
TOOLS                   619
GAME                    410
FINANCE                 254
MEDICAL                 253
LIFESTYLE               243
PERSONALIZATION         241
PRODUCTIVITY            220
BUSINESS                218
SPORTS                  203
PHOTOGRAPHY             191
HEALTH_AND_FITNESS      171
COMMUNICATION           168
TRAVEL_AND_LOCAL        138
BOOKS_AND_REFERENCE     128
SHOPPING                125
NEWS_AND_MAGAZINES      102
VIDEO_PLAYERS            95
MAPS_AND_NAVIGATION      91
EDUCATION                82
FOOD_AND_DRINK           66
AUTO_AND_VEHICLES        61
LIBRARIES_AND_DEMO       61
SOCIAL                   59
ART_AND_DESIGN           55
HOUSE_AND_HOME           48
WEATHER                  47
PARENTING                42
BEAUTY                   34
EVENTS                   32
ENTERTAINMENT            26
COMICS                   23
DATING                    6
Name: Category, dtype: int64
In [83]:
features=['Size','Price','FAMILY', 'TOOLS','GAME', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen']
In [84]:
X=df_fix[features].copy()
In [85]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

La idea no es predecir, sino que entender cuales son las mejores Apps dado ciertos atributos. Se usa un árbol de decisión para separar las mejores aplicaciones de las peores reduciendo la heterogeneidad de los grupos.

In [86]:
rating_classifier = tree.DecisionTreeClassifier(max_leaf_nodes=15, random_state=0,min_samples_leaf=8)
rating_classifier.fit(X, y)
tree.export_graphviz(rating_classifier) 
export_graphviz(rating_classifier, 'tree2.dot', rounded = True, 
                feature_names = ['Size','Price','FAMILY', 'TOOLS','GAME', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen'], 
                class_names = ['regulares', 'buenisimos'], filled = True)
call(['dot', '-Tpng', 'tree2.dot', '-o', 'tree2.png', '-Gdpi=400']);
from IPython.display import Image
Image('tree2.png')
Out[86]:

Siguiendo el árbol las mejores aplicaciones están en: precio menor a 1.345 dólares pesan entre 61 y 63 MB y son para Teen

In [87]:
rating_classifier = tree.DecisionTreeClassifier(max_leaf_nodes=20, random_state=0,min_samples_leaf=5)
rating_classifier.fit(X, y)
tree.export_graphviz(rating_classifier) 
export_graphviz(rating_classifier, 'tree2.dot', rounded = True, 
                feature_names = ['Size','Price','FAMILY', 'TOOLS','GAME', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen'], 
                class_names = ['regulares', 'buenisimos'], filled = True)
call(['dot', '-Tpng', 'tree2.dot', '-o', 'tree2.png', '-Gdpi=400']);
from IPython.display import Image
Image('tree2.png')
Out[87]:
  • Precio menor a 1.345 dólares pesan entre 61 y 63 MB y son para Teen (son principalmente juegos ver abajo!).
  • Precio sobre los 1.34 pero menor a 19.49 dólares que no son ni de juegos ni familiares y pesan entre 1.25 y 25MB. (notemos que estás dos son las categorías con Apps.)
In [88]:
df_fix.columns
Out[88]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Android Ver',
       'installs', 'reviews', 'rating', 'ART_AND_DESIGN', 'AUTO_AND_VEHICLES',
       'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FAMILY', 'FINANCE',
       'FOOD_AND_DRINK', 'GAME', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'MAPS_AND_NAVIGATION', 'MEDICAL',
       'NEWS_AND_MAGAZINES', 'PARENTING', 'PERSONALIZATION', 'PHOTOGRAPHY',
       'PRODUCTIVITY', 'SHOPPING', 'SOCIAL', 'SPORTS', 'TOOLS',
       'TRAVEL_AND_LOCAL', 'VIDEO_PLAYERS', 'WEATHER', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen'],
      dtype='object')
In [89]:
df_fix[df_fix['Content Rating']=='Teen'].Category.value_counts()
Out[89]:
GAME                   254
FAMILY                 162
SOCIAL                  49
ENTERTAINMENT           31
NEWS_AND_MAGAZINES      20
SHOPPING                17
COMICS                  16
VIDEO_PLAYERS           12
PERSONALIZATION         11
SPORTS                  10
COMMUNICATION           10
LIFESTYLE                9
HEALTH_AND_FITNESS       9
BOOKS_AND_REFERENCE      6
PHOTOGRAPHY              6
ART_AND_DESIGN           3
FINANCE                  3
TOOLS                    3
WEATHER                  2
BEAUTY                   2
EVENTS                   2
FOOD_AND_DRINK           2
MEDICAL                  2
HOUSE_AND_HOME           2
MAPS_AND_NAVIGATION      2
TRAVEL_AND_LOCAL         2
BUSINESS                 1
AUTO_AND_VEHICLES        1
DATING                   1
EDUCATION                1
PARENTING                1
Name: Category, dtype: int64

falta probar clustering! y ver si jugar con reviews,rating, sizing, price. Luego al tener los clustering ver en que categorìa caen ciertos cluster o géneros

In [90]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
In [91]:
Labels2=['Rating', 'Reviews', 'Size', 'Installs','Price']
In [92]:
X2=df_fix[Labels2].copy()
In [93]:
X2=StandardScaler().fit_transform(X2)
C:\Users\rfuen\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
C:\Users\rfuen\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
In [94]:
def distancia_centroid(X2, i):
    kmeans = KMeans(n_clusters=i)
    model = kmeans.fit(X2)
    distance=model.inertia_

    return distance
In [95]:
Distancias=[]
n_clusters=[]
for i in range(1,30):
    distancia=distancia_centroid(X2, i)
    Distancias.append(distancia)
    n_clusters.append(i)
In [96]:
loss1=pd.DataFrame(data=n_clusters, columns= ['n_clusters'])
loss2=pd.DataFrame(data=Distancias, columns= ['distancia'])
obj=[loss1,loss2]
loss=pd.concat(obj, axis=1)

loss
Out[96]:
n_clusters distancia
0 1 25830.000000
1 2 20716.666899
2 3 16619.753101
3 4 12002.394724
4 5 9161.028934
5 6 7275.552729
6 7 6227.129408
7 8 5440.355896
8 9 4805.269775
9 10 4272.285271
10 11 3875.953004
11 12 3462.844898
12 13 3147.646244
13 14 2908.347904
14 15 2725.784253
15 16 2503.929097
16 17 2285.294452
17 18 2145.501813
18 19 1995.904834
19 20 1871.343728
20 21 1791.103288
21 22 1702.706756
22 23 1649.586018
23 24 1556.935489
24 25 1513.429037
25 26 1430.427236
26 27 1368.626633
27 28 1322.782759
28 29 1273.817443
In [97]:
b = sns.catplot(x="n_clusters", y="distancia",
              data=loss, kind="point",
               height=5, aspect=1.5);

Hay 4 o 5 clusters en el segmento de las aplicaciones

In [98]:
kmeans = KMeans(n_clusters=5, n_init=15,max_iter=1000)
model = kmeans.fit(X2)
print("model\n", model)
model
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
    n_clusters=5, n_init=15, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [99]:
centers = model.cluster_centers_
model.inertia_
Out[99]:
9161.038497679896
In [100]:
def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P
In [101]:
def parallel_plot(data):
	my_colors = list(islice(cycle(['b', 'y', 'g', 'r', 'k']), None, len(data)))
	plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+15])
	parallel_coordinates(data, 'prediction', color = my_colors, marker='o')
In [102]:
P = pd_centers(Labels2, centers)
P
Out[102]:
Rating Reviews Size Installs Price prediction
0 0.416318 -0.098963 -0.440064 -0.088879 -0.042990 0
1 -0.626373 -0.164803 -0.779184 -0.192315 21.512672 1
2 0.241208 0.294477 1.599036 0.211462 -0.047579 2
3 -1.556863 -0.153248 -0.376211 -0.148027 -0.052983 3
4 0.471805 11.760428 1.216618 16.212463 -0.059951 4
In [103]:
parallel_plot(P)
  • Semento 0: Este segmento es el segundo con más aplicaciones. Son aplicaciones pesadas con rating intermedio. (principalmente son Apps juegos, familiares y deporte.)
  • Semento 1: Este es el segmento con más Aplicaciones y excelente rating. Son livianas con muchas calificaciones para su nivel instalaciones. (principalmente son Apps familiares, herramientas y juegos.)
  • Semento 2: Segmento muy chico con aplicaciones relativamente pesadas pero son las mejores evaluadas (en promedio), las más descargadas y las más calificadas por usuarios. (principalmente hay juegos en este segmento y Apps de comunicación.) Además son todas sus aplicaciones gratuitas.
  • Semento 3: Son muy pocas aplicaciones, muy caras y pésimo rating. (Hay 4 aplicaciones de finanzas y de estilo de vida, además de 3 familiares)
  • Semento 4: Son las peores evaluadas, con pocas calificaciones con tamaño intermedio y gratuitas o muy económicas. (tienen misma distribución de categorías que el segmento 1.)
In [ ]:
 
In [104]:
df_fix['clusters']=model.labels_
df_fix[df_fix['clusters']==2] #conoce estás App?? probablemente sí!
Out[104]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres ... SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS WEATHER Everyone Everyone 10+ Mature 17+ Teen clusters
50 Real Tractor Farming AUTO_AND_VEHICLES 4.0 1598 56.0 1000000 Free 0.0 Everyone Auto & Vehicles ... 0 0 0 0 0 1 0 0 0 2
51 Ultimate F1 Racing Championship AUTO_AND_VEHICLES 3.8 284 57.0 100000 Free 0.0 Everyone Auto & Vehicles ... 0 0 0 0 0 1 0 0 0 2
57 Extreme Rally Championship AUTO_AND_VEHICLES 4.2 129 54.0 100000 Free 0.0 Everyone Auto & Vehicles ... 0 0 0 0 0 1 0 0 0 2
103 Beauty Selfie Camera BEAUTY 4.2 2225 52.0 500000 Free 0.0 Everyone Beauty ... 0 0 0 0 0 1 0 0 0 2
122 Sephora: Skin Care, Beauty Makeup & Fragrance ... BEAUTY 4.5 26834 57.0 1000000 Free 0.0 Everyone Beauty ... 0 0 0 0 0 1 0 0 0 2
169 English Persian Dictionary BOOKS_AND_REFERENCE 4.5 26875 73.0 500000 Free 0.0 Everyone Books & Reference ... 0 0 0 0 0 1 0 0 0 2
194 OfficeSuite : Free Office + PDF Editor BUSINESS 4.3 1002861 35.0 100000000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
205 Polaris Office for LG BUSINESS 4.2 30847 55.0 5000000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
242 Insightly CRM BUSINESS 3.8 1383 51.0 100000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
243 QuickBooks Accounting: Invoicing & Expenses BUSINESS 4.3 23175 41.0 1000000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
247 Crew - Free Messaging and Scheduling BUSINESS 4.6 4159 48.0 500000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
258 Cisco Webex Teams BUSINESS 4.2 1661 46.0 100000 Free 0.0 Everyone Business ... 0 0 0 0 0 1 0 0 0 2
345 Yahoo Mail – Stay Organized COMMUNICATION 4.3 4187998 16.0 100000000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
379 My Vodacom SA COMMUNICATION 3.7 25021 61.0 5000000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
380 Microsoft Edge COMMUNICATION 4.3 27187 66.0 5000000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
395 GO SMS Pro - Messenger, Free Themes, Emoji COMMUNICATION 4.4 2876500 24.0 100000000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
455 Email TypeApp - Mail App COMMUNICATION 4.6 183374 44.0 1000000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
485 Hily: Dating, Chat, Match, Meet & Hook up DATING 4.1 2556 56.0 100000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
490 CMB Free Dating App DATING 4.0 48845 40.0 1000000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
497 Mingle2 - Free Online Dating & Singles Chat Rooms DATING 4.3 37053 44.0 1000000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
516 Sudy – Meet Elite & Rich Single DATING 4.1 17268 40.0 500000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
534 Gay Sugar Daddy Dating & Hookup – Sudy Gay DATING 4.1 2212 41.0 100000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
569 SweetRing - Meet, Match, Date DATING 4.0 51698 63.0 1000000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
570 BiggerCity: Chat for gay bears, chubs & chasers DATING 4.1 923 44.0 100000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
581 JustDating DATING 4.0 13440 49.0 500000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
600 iPair-Meet, Chat, Dating DATING 4.5 182986 77.0 5000000 Free 0.0 Mature 17+ Dating ... 0 0 0 0 0 0 0 1 0 2
715 Dinosaurs Coloring Pages EDUCATION 4.4 390 41.0 500000 Free 0.0 Everyone Education;Education ... 0 0 0 0 0 1 0 0 0 2
716 Cars Coloring Pages EDUCATION 4.4 1090 49.0 1000000 Free 0.0 Everyone Education;Creativity ... 0 0 0 0 0 1 0 0 0 2
719 Monster Truck Driver & Racing EDUCATION 4.4 748 51.0 1000000 Free 0.0 Everyone Education;Action & Adventure ... 0 0 0 0 0 1 0 0 0 2
728 Free intellectual training game application | EDUCATION 4.2 5741 84.0 1000000 Free 0.0 Everyone Education;Pretend Play ... 0 0 0 0 0 1 0 0 0 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10426 Strike! Ten Pin Bowling SPORTS 4.2 18584 49.0 5000000 Free 0.0 Everyone Sports ... 1 0 0 0 0 1 0 0 0 2
10429 Talking Tom Bubble Shooter FAMILY 4.4 687136 54.0 50000000 Free 0.0 Everyone Casual ... 0 0 0 0 0 1 0 0 0 2
10431 Forgotten Hill Mementoes GAME 4.6 2027 56.0 100000 Free 0.0 Teen Adventure ... 0 0 0 0 0 0 0 0 1 2
10469 TownWiFi | Wi-Fi Everywhere COMMUNICATION 3.9 2372 58.0 500000 Free 0.0 Everyone Communication ... 0 0 0 0 0 1 0 0 0 2
10480 FJ 4x4 Cruiser Offroad Driving FAMILY 4.1 3543 49.0 500000 Free 0.0 Everyone Simulation ... 0 0 0 0 0 1 0 0 0 2
10481 FJ 4x4 Cruiser Snow Driving FAMILY 4.2 1619 43.0 500000 Free 0.0 Everyone Simulation ... 0 0 0 0 0 1 0 0 0 2
10503 Offroad 4x4 Car Driving FAMILY 4.3 26224 43.0 1000000 Free 0.0 Everyone Simulation ... 0 0 0 0 0 1 0 0 0 2
10504 Motocross Beach Jumping 3D FAMILY 4.0 105954 43.0 10000000 Free 0.0 Teen Simulation ... 0 0 0 0 0 0 0 0 1 2
10507 Rope Hero: Vice Town GAME 4.4 452589 99.0 10000000 Free 0.0 Mature 17+ Action ... 0 0 0 0 0 0 0 1 0 2
10508 Drive 4x4 Luxury SUV Jeep GAME 4.2 2183 46.0 500000 Free 0.0 Everyone Racing ... 0 0 0 0 0 1 0 0 0 2
10520 Driving Suv Toyota Car Simulator FAMILY 3.7 187 54.0 10000 Free 0.0 Everyone Simulation ... 0 0 0 0 0 1 0 0 0 2
10521 Navy Gunner Shoot War 3D GAME 4.0 103199 44.0 10000000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10526 Fairy Kingdom: World of Magic and Farming FAMILY 4.4 129542 63.0 1000000 Free 0.0 Everyone Strategy;Creativity ... 0 0 0 0 0 1 0 0 0 2
10673 Magnum 3.0 Gun Custom SImulator GAME 4.5 16815 59.0 1000000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10686 Armed Cam Gun Pack GAME 4.2 1012 50.0 10000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10687 Zombie Defense FAMILY 4.3 275048 41.0 10000000 Free 0.0 Teen Strategy ... 0 0 0 0 0 0 0 0 1 2
10711 4x4 Jeep Racer GAME 4.1 7279 54.0 1000000 Free 0.0 Everyone Racing ... 0 0 0 0 0 1 0 0 0 2
10717 Frontline Terrorist Battle Shoot: Free FPS Sho... GAME 4.2 9183 49.0 1000000 Free 0.0 Mature 17+ Action ... 0 0 0 0 0 0 0 1 0 2
10723 Mobile Kick SPORTS 4.3 111809 40.0 10000000 Free 0.0 Everyone Sports ... 1 0 0 0 0 1 0 0 0 2
10731 FeaturePoints: Free Gift Cards FAMILY 3.9 121321 46.0 5000000 Free 0.0 Everyone Entertainment ... 0 0 0 0 0 1 0 0 0 2
10770 Modern Counter Terrorist FPS Shoot GAME 4.0 795 41.0 100000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10779 Fortune Quest: Savior FAMILY 3.6 135 75.0 10000 Free 0.0 Everyone 10+ Role Playing ... 0 0 0 0 0 0 1 0 0 2
10781 Modern Strike Online GAME 4.3 834117 44.0 10000000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10783 Modern Counter Terror Attack – Shooting Game GAME 4.2 340 72.0 50000 Free 0.0 Mature 17+ Action ... 0 0 0 0 0 0 0 1 0 2
10784 Big Hunter GAME 4.3 245455 84.0 10000000 Free 0.0 Everyone 10+ Action ... 0 0 0 0 0 0 1 0 0 2
10787 Modern Counter Global Strike 3D GAME 4.1 297 48.0 50000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2
10789 Modern Counter Global Strike 3D V2 GAME 4.0 368 48.0 50000 Free 0.0 Everyone 10+ Action ... 0 0 0 0 0 0 1 0 0 2
10793 Sid Story GAME 4.4 28510 78.0 500000 Free 0.0 Teen Card ... 0 0 0 0 0 0 0 0 1 2
10797 Fuel Rewards® program LIFESTYLE 4.6 32433 46.0 1000000 Free 0.0 Everyone Lifestyle ... 0 0 0 0 0 1 0 0 0 2
10803 Fatal Raid - No.1 Mobile FPS GAME 4.3 56496 81.0 1000000 Free 0.0 Teen Action ... 0 0 0 0 0 0 0 0 1 2

1075 rows × 53 columns

In [105]:
df_fix.clusters.value_counts()
Out[105]:
0    3079
2    1075
3     989
4      12
1      11
Name: clusters, dtype: int64
In [106]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="clusters", y="Rating", data=df_fix, jitter=True, linewidth=1)
In [107]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="clusters", y="Price", data=df_fix, jitter=True, linewidth=1)
In [108]:
a=sns.pairplot(df_fix,vars=['Rating', 'reviews', 'Size', 'installs'], hue='clusters')

salgamos de dudas respecto al peso de una APP de juegos

todas las app pesan distinto pero los gráficos muestran en general una idea que las app de juegos pesan más que el resto es esto cierto?

In [109]:
df_fix[df_fix['Category']=='GAME'].Size.describe()
Out[109]:
count    751.000000
mean      44.941623
std       27.063867
min        0.116000
25%       23.000000
50%       41.000000
75%       63.000000
max      100.000000
Name: Size, dtype: float64
In [110]:
df_fix[df_fix['Category']!='GAME'].Size.describe()
Out[110]:
count    4415.000000
mean       21.123721
std        21.398764
min         0.008500
25%         5.500000
50%        13.000000
75%        29.000000
max       100.000000
Name: Size, dtype: float64
In [111]:
df_fix.columns
Out[111]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Android Ver',
       'installs', 'reviews', 'rating', 'ART_AND_DESIGN', 'AUTO_AND_VEHICLES',
       'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FAMILY', 'FINANCE',
       'FOOD_AND_DRINK', 'GAME', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'MAPS_AND_NAVIGATION', 'MEDICAL',
       'NEWS_AND_MAGAZINES', 'PARENTING', 'PERSONALIZATION', 'PHOTOGRAPHY',
       'PRODUCTIVITY', 'SHOPPING', 'SOCIAL', 'SPORTS', 'TOOLS',
       'TRAVEL_AND_LOCAL', 'VIDEO_PLAYERS', 'WEATHER', 'Everyone',
       'Everyone 10+', 'Mature 17+', 'Teen', 'clusters'],
      dtype='object')
In [112]:
b=df_fix.dtypes
In [113]:
df_fix[df_fix['clusters']==0].Category.value_counts()
Out[113]:
FAMILY                 508
TOOLS                  287
GAME                   274
PERSONALIZATION        162
PRODUCTIVITY           120
FINANCE                116
PHOTOGRAPHY            115
SPORTS                 114
LIFESTYLE               99
SHOPPING                97
HEALTH_AND_FITNESS      94
COMMUNICATION           91
SOCIAL                  82
MEDICAL                 82
BOOKS_AND_REFERENCE     79
NEWS_AND_MAGAZINES      79
BUSINESS                78
EDUCATION               71
TRAVEL_AND_LOCAL        53
VIDEO_PLAYERS           49
DATING                  47
ART_AND_DESIGN          44
FOOD_AND_DRINK          42
ENTERTAINMENT           40
MAPS_AND_NAVIGATION     38
WEATHER                 37
AUTO_AND_VEHICLES       34
HOUSE_AND_HOME          34
LIBRARIES_AND_DEMO      27
COMICS                  25
PARENTING               23
EVENTS                  20
BEAUTY                  18
Name: Category, dtype: int64
In [114]:
df_fix[df_fix['clusters']==1].Category.value_counts()
Out[114]:
FINANCE      4
LIFESTYLE    4
FAMILY       3
Name: Category, dtype: int64
In [115]:
df_fix[df_fix['clusters']==2].Category.value_counts()
Out[115]:
GAME                   384
FAMILY                 375
SPORTS                  43
HEALTH_AND_FITNESS      30
FINANCE                 23
PHOTOGRAPHY             21
SOCIAL                  21
TRAVEL_AND_LOCAL        19
LIFESTYLE               16
VIDEO_PLAYERS           12
ENTERTAINMENT           11
MEDICAL                 10
EDUCATION               10
COMMUNICATION           10
MAPS_AND_NAVIGATION      9
DATING                   9
PRODUCTIVITY             9
SHOPPING                 8
BUSINESS                 8
TOOLS                    7
PERSONALIZATION          7
BOOKS_AND_REFERENCE      6
PARENTING                6
FOOD_AND_DRINK           5
AUTO_AND_VEHICLES        4
BEAUTY                   3
HOUSE_AND_HOME           3
LIBRARIES_AND_DEMO       3
WEATHER                  2
EVENTS                   1
Name: Category, dtype: int64
In [116]:
df_fix[df_fix['clusters']==3].Category.value_counts()
Out[116]:
FAMILY                 213
TOOLS                  134
GAME                    87
LIFESTYLE               55
FINANCE                 40
DATING                  33
PHOTOGRAPHY             30
COMMUNICATION           29
SPORTS                  27
BUSINESS                27
TRAVEL_AND_LOCAL        26
MEDICAL                 25
PRODUCTIVITY            25
NEWS_AND_MAGAZINES      24
HEALTH_AND_FITNESS      23
VIDEO_PLAYERS           22
MAPS_AND_NAVIGATION     19
PERSONALIZATION         19
SHOPPING                17
COMICS                  14
FOOD_AND_DRINK          14
ENTERTAINMENT           13
SOCIAL                  12
HOUSE_AND_HOME          10
BOOKS_AND_REFERENCE      9
WEATHER                  8
AUTO_AND_VEHICLES        8
LIBRARIES_AND_DEMO       8
ART_AND_DESIGN           6
EDUCATION                4
PARENTING                4
BEAUTY                   3
EVENTS                   1
Name: Category, dtype: int64
In [117]:
df_fix[df_fix['clusters']==4].Category.value_counts()
Out[117]:
GAME                  6
COMMUNICATION         2
HEALTH_AND_FITNESS    1
NEWS_AND_MAGAZINES    1
TOOLS                 1
PRODUCTIVITY          1
Name: Category, dtype: int64

Sentiment Analysis

In [118]:
df1=pd.read_csv('./googleplaystore_user_reviews.csv')
In [119]:
df1.head()
Out[119]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.00 0.533333
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462
2 10 Best Foods for You NaN NaN NaN NaN
3 10 Best Foods for You Works great especially going grocery store Positive 0.40 0.875000
4 10 Best Foods for You Best idea us Positive 1.00 0.300000
In [120]:
df=pd.merge(df1,df_fix, on="App", how="inner")
In [121]:
df.head(4)
Out[121]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs ... SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS WEATHER Everyone Everyone 10+ Mature 17+ Teen clusters
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.00 0.533333 HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462 HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0
2 10 Best Foods for You NaN NaN NaN NaN HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0
3 10 Best Foods for You Works great especially going grocery store Positive 0.40 0.875000 HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0

4 rows × 57 columns

In [122]:
grouped_sentiment_app_count = df.groupby(['App', 'Rating','reviews','Type','installs','Size']).agg({'Sentiment_Polarity': 'mean','Sentiment_Subjectivity': 'mean'}).reset_index()
grouped_sentiment_app_count.sort_values(by=['Rating','Sentiment_Subjectivity','Rating'], ascending=False)
Out[122]:
App Rating reviews Type installs Size Sentiment_Polarity Sentiment_Subjectivity
365 DMV Permit Practice Test 2018 Edition 4.9 8.714403 Free 11.512925 27.000 0.295660 0.559806
400 Down Dog: Great Yoga Anywhere 4.9 10.273153 Free 13.122363 12.000 0.291847 0.526837
222 CDL Practice Test 2018 Edition 4.9 8.958540 Free 11.512925 17.000 0.241126 0.477825
483 FREE LIVE TALK 4.9 6.654153 Free 8.517193 4.900 NaN NaN
587 GPS Speedometer and Odometer 4.8 9.671871 Free 13.815511 3.300 0.687500 0.654167
565 FreePrints – Free Photos Delivered 4.8 11.603680 Free 13.815511 37.000 0.446181 0.626680
524 Find a Way: Addictive Puzzle 4.8 10.583549 Free 13.122363 14.000 0.069630 0.585089
580 Fuzzy Seasons: Animal Forest 4.8 9.404014 Free 11.512925 63.000 0.168933 0.579257
236 Calculator with Percent (Free) 4.8 10.783342 Free 13.815511 7.400 0.031973 0.563861
702 Home Workout - No Equipment 4.8 12.967243 Free 16.118096 15.000 0.338352 0.555221
78 Amino: Communities and Chats 4.8 14.045888 Free 16.118096 62.000 0.047590 0.537536
369 DU Recorder – Screen Recorder, Video Editor, Live 4.8 14.766678 Free 17.727534 9.700 0.224187 0.524684
703 Home Workout for Men - Bodybuilding 4.8 9.449751 Free 13.815511 15.000 0.523450 0.470104
465 Even - organize your money, get paid early 4.8 9.417680 Free 11.512925 21.000 0.283929 0.462738
628 GoodRx Drug Prices and Coupons 4.8 10.987967 Free 13.815511 11.000 0.252471 0.449028
269 Cash, Inc. Money Clicker Game & Business Adven... 4.8 13.217164 Free 16.118096 85.000 NaN NaN
306 Classical music for baby 4.8 7.570443 Free 11.512925 38.000 NaN NaN
326 CompTIA Exam Training 4.8 8.023880 Free 10.819778 17.000 NaN NaN
453 English Grammar Test 4.8 8.312626 Free 13.122363 5.100 NaN NaN
550 Free Books - Spirit Fanfiction and Stories 4.8 11.665707 Free 13.815511 5.000 NaN NaN
609 Girls Live Chat - Free Text & Video Chat 4.8 4.700480 Free 9.210340 4.900 NaN NaN
132 Backgrounds (HD Wallpapers) 4.7 12.218367 Free 16.118096 3.000 0.264432 0.599071
188 Blood Pressure Log - MyDiary 4.7 9.029777 Free 13.122363 2.600 0.325055 0.591402
2 1800 Contacts - Lens Store 4.7 10.050182 Free 13.815511 26.000 0.318145 0.591098
488 Face Filter, Selfie Editor - Sweet Camera 4.7 11.868037 Free 16.118096 22.000 0.194281 0.588866
35 ASOS 4.7 12.110651 Free 16.118096 22.000 0.316917 0.586405
319 Colorfit - Drawing & Coloring 4.7 9.916404 Free 13.122363 25.000 0.171836 0.572762
144 Baritastic - Bariatric Tracker 4.7 8.370548 Free 11.512925 12.000 0.418277 0.571031
103 Associated Credit Union Mobile 4.7 8.098643 Free 10.819778 12.000 0.388093 0.559535
364 DIY On A Budget 4.7 4.736198 Free 9.210340 8.300 0.448063 0.557628
... ... ... ... ... ... ... ... ...
108 Aviary Stickers: Free Pack 3.5 11.750855 Free 16.118096 0.624 0.065829 0.672514
109 Azpen eReader 3.5 5.049856 Free 13.122363 42.000 0.242624 0.597326
113 BBWCupid - BBW Dating App 3.5 5.484797 Free 10.819778 2.800 0.126912 0.471693
172 BioLife Plasma Services 3.5 5.521461 Free 11.512925 23.000 0.154806 0.457704
229 CVS Caremark 3.5 8.217978 Free 13.122363 10.000 0.191908 0.429123
496 Fantasy Football 3.5 10.823352 Free 13.815511 23.000 0.062581 0.411021
435 EasyBib: Citation Generator 3.5 7.247793 Free 11.512925 7.300 0.071832 0.391224
662 HTC Speak 3.5 8.723394 Free 16.118096 13.000 0.004106 0.334344
19 A Manual of Acupuncture 3.5 5.365976 Paid 6.907755 68.000 NaN NaN
615 GlassesOff 3.5 7.160846 Free 11.512925 38.000 NaN NaN
150 BeWild Free Dating & Chat App 3.4 7.600402 Free 11.512925 8.000 0.170372 0.518070
145 Baseball Boy! 3.4 11.906163 Free 16.118096 78.000 0.012272 0.474480
555 Free Foreclosure Real Estate Search by USHUD.com 3.4 5.659482 Free 11.512925 27.000 -0.038367 0.440234
407 Draw A Stickman 3.4 10.284148 Free 13.815511 17.000 -0.150000 0.400000
318 ColorSnap® Visualizer 3.4 8.095599 Free 13.815511 77.000 0.024391 0.349993
549 Free Book Reader 3.4 7.426549 Free 11.512925 4.000 NaN NaN
554 Free Dating Hook Up Messenger 3.3 7.053586 Free 11.512925 21.000 0.147288 0.522336
48 Adult Dating - AdultFinder 3.3 7.378384 Free 13.122363 3.800 NaN NaN
346 Create A Superhero HD 3.3 9.064389 Free 13.122363 19.000 NaN NaN
658 HSL - Tickets, route planner and information 3.3 5.545177 Free 11.512925 17.000 NaN NaN
458 Entel 3.2 9.690789 Free 13.815511 55.000 0.018577 0.331314
371 Daily Manga - Comic & Webtoon 3.2 7.276556 Free 11.512925 7.100 NaN NaN
329 ConnectLine 3.1 5.533389 Free 10.819778 4.200 0.138012 0.450320
474 EyeCloud 3.1 7.144407 Free 11.512925 55.000 0.117179 0.355441
70 Allegiant 3.1 8.625509 Free 13.815511 21.000 0.233333 0.233333
43 Acorn TV: World-class TV from Britain and Beyond 3.0 6.200509 Free 10.819778 23.000 0.086058 0.507038
501 FarmersOnly Dating 3.0 7.044033 Free 11.512925 1.400 NaN NaN
91 Anthem Anywhere 2.7 7.884953 Free 13.122363 24.000 -0.087977 0.385036
92 Anthem BC Anywhere 2.6 6.206576 Free 11.512925 24.000 -0.123233 0.367896
444 EliteSingles – Dating for Single Professionals 2.5 8.589886 Free 13.122363 19.000 NaN NaN

716 rows × 8 columns

In [123]:
corr = grouped_sentiment_app_count.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(10, 200, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);
In [124]:
a=sns.pairplot(grouped_sentiment_app_count,vars=['Rating', 'Sentiment_Polarity', 'Sentiment_Subjectivity', 'reviews','installs','Size'], hue='Type')
C:\Users\rfuen\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:448: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
C:\Users\rfuen\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:448: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
In [125]:
#grouped_sentiment_category_sum = df.groupby(['Category']).agg({'Sentiment': 'count'}).reset_index()
#grouped_sentiment_category_sum
#df_sentiment=pd.merge(grouped_sentiment_category_sum,grouped_sentiment_category_count, how='inner', on='Category')
#df_sentiment
In [126]:
grouped_sentiment_category_count = df.groupby(['Category', 'Sentiment']).agg({'App': 'count'}).reset_index()
grouped_sentiment_category_count.head(5)
Out[126]:
Category Sentiment App
0 ART_AND_DESIGN Negative 58
1 ART_AND_DESIGN Neutral 54
2 ART_AND_DESIGN Positive 233
3 AUTO_AND_VEHICLES Negative 11
4 AUTO_AND_VEHICLES Neutral 20
In [127]:
df1=pd.DataFrame()
df2=pd.DataFrame()
df3=pd.DataFrame()
df1['Category']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Negative'].Category
df2['Category']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Positive'].Category
df3['Category']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Neutral'].Category
df2.head()
Out[127]:
Category
2 ART_AND_DESIGN
5 AUTO_AND_VEHICLES
8 BEAUTY
11 BOOKS_AND_REFERENCE
14 BUSINESS
In [128]:
df1['Negative']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Negative'].App
df2['Positive']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Positive'].App
df3['Neutral']=grouped_sentiment_category_count[grouped_sentiment_category_count['Sentiment']=='Neutral'].App
df4=pd.merge(df1,df2, how='inner', on='Category')
df5=pd.merge(df3,df4, how='inner', on='Category')
df5['Total']=df5['Neutral']+df5['Negative']+df5['Positive']
df5['positive']=df5['Positive']/df5['Total']
df5['neutral']=df5['Neutral']/df5['Total']
df5['negative']=df5['Negative']/df5['Total']
df5['total']=1
df5=df5.sort_values(by=['positive','neutral'])
df5.head(2)
Out[128]:
Category Neutral Negative Positive Total positive neutral negative total
2 BEAUTY 83 57 162 302 0.536424 0.274834 0.188742 1
8 ENTERTAINMENT 137 140 377 654 0.576453 0.209480 0.214067 1
In [129]:
df5.tail(2)
Out[129]:
Category Neutral Negative Positive Total positive neutral negative total
14 HEALTH_AND_FITNESS 169 150 1263 1582 0.798357 0.106827 0.094817 1
1 AUTO_AND_VEHICLES 20 11 133 164 0.810976 0.121951 0.067073 1
In [130]:
df6=df5.sort_values(by=['negative'], ascending=False)
df6.head(2)
Out[130]:
Category Neutral Negative Positive Total positive neutral negative total
13 GAME 220 1843 2966 5029 0.589779 0.043746 0.366474 1
10 FAMILY 178 492 1062 1732 0.613164 0.102771 0.284065 1
In [131]:
df7=df5.sort_values(by=['neutral'], ascending=False)
df7
Out[131]:
Category Neutral Negative Positive Total positive neutral negative total
2 BEAUTY 83 57 162 302 0.536424 0.274834 0.188742 1
28 TOOLS 179 112 493 784 0.628827 0.228316 0.142857 1
21 PARENTING 41 19 124 184 0.673913 0.222826 0.103261 1
8 ENTERTAINMENT 137 140 377 654 0.576453 0.209480 0.214067 1
17 LIFESTYLE 182 154 561 897 0.625418 0.202899 0.171683 1
4 BUSINESS 131 111 413 655 0.630534 0.200000 0.169466 1
15 HOUSE_AND_HOME 82 81 248 411 0.603406 0.199513 0.197080 1
30 VIDEO_PLAYERS 38 37 123 198 0.621212 0.191919 0.186869 1
3 BOOKS_AND_REFERENCE 52 23 200 275 0.727273 0.189091 0.083636 1
5 COMMUNICATION 74 59 261 394 0.662437 0.187817 0.149746 1
25 SHOPPING 102 132 350 584 0.599315 0.174658 0.226027 1
29 TRAVEL_AND_LOCAL 136 177 475 788 0.602792 0.172589 0.224619 1
6 DATING 249 291 960 1500 0.640000 0.166000 0.194000 1
24 PRODUCTIVITY 70 77 280 427 0.655738 0.163934 0.180328 1
23 PHOTOGRAPHY 113 140 438 691 0.633864 0.163531 0.202605 1
12 FOOD_AND_DRINK 62 49 272 383 0.710183 0.161880 0.127937 1
20 NEWS_AND_MAGAZINES 104 168 371 643 0.576983 0.161742 0.261275 1
18 MAPS_AND_NAVIGATION 25 25 106 156 0.679487 0.160256 0.160256 1
0 ART_AND_DESIGN 54 58 233 345 0.675362 0.156522 0.168116 1
31 WEATHER 17 9 83 109 0.761468 0.155963 0.082569 1
19 MEDICAL 179 186 788 1153 0.683435 0.155247 0.161318 1
22 PERSONALIZATION 125 120 572 817 0.700122 0.152999 0.146879 1
27 SPORTS 131 229 515 875 0.588571 0.149714 0.261714 1
7 EDUCATION 69 56 342 467 0.732334 0.147752 0.119914 1
11 FINANCE 153 236 677 1066 0.635084 0.143527 0.221388 1
16 LIBRARIES_AND_DEMO 44 51 238 333 0.714715 0.132132 0.153153 1
9 EVENTS 15 16 87 118 0.737288 0.127119 0.135593 1
26 SOCIAL 38 77 195 310 0.629032 0.122581 0.248387 1
1 AUTO_AND_VEHICLES 20 11 133 164 0.810976 0.121951 0.067073 1
14 HEALTH_AND_FITNESS 169 150 1263 1582 0.798357 0.106827 0.094817 1
10 FAMILY 178 492 1062 1732 0.613164 0.102771 0.284065 1
13 GAME 220 1843 2966 5029 0.589779 0.043746 0.366474 1

BEAUTY es la categoría que tiene menor proporción de comentarios escritos positivos. Es decir, la menos amada por sus usuarios. Además tiene muchos comentarios neutrales, es decir no evoca fuertes sentimientos sus aplicaciones. GAME, por otra parte, es la categoría con más aplicaciones odiadas. Es una categoría con muy pocos mensajes neutrales. Las aplicaciones son amadas o odiadas al parecer no hay punto intermedio. las categorías AUTO_AND_VEHICLES y HEALTH_AND_FITNESS son las más amadas.

In [132]:
sizes=[0.536424,0.274834,0.188742]
fig1, ax1 = plt.subplots(figsize=(10, 15))


ax1.pie(sizes, autopct='%1.1f%%',
         startangle=90)

plt.title("Distribución de tipos de comentarios de la categoría BEAUTY", size = 20)

ax1.legend(['positivo','neutral','negativo'],
          title="Tipo de comentario",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.show()
In [133]:
sizes=[0.589779,0.043746, 0.366474]
fig1, ax1 = plt.subplots(figsize=(10, 15))


ax1.pie(sizes, autopct='%1.1f%%',
         startangle=90)

plt.title("Distribución de tipos de comentarios de la categoría GAME", size = 20)

ax1.legend(['positivo','neutral','negativo'],
          title="Tipo de comentario",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.show()
In [134]:
sizes=[0.810976,0.121951,0.067073]
fig1, ax1 = plt.subplots(figsize=(10, 15))


ax1.pie(sizes, autopct='%1.1f%%',
         startangle=90)

plt.title("Distribución de tipos de comentarios de la categoría AUTO_AND_VEHICLES", size = 20)

ax1.legend(['positivo','neutral','negativo'],
          title="Tipo de comentario",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.show()
In [135]:
df.head(2)
Out[135]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs ... SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS WEATHER Everyone Everyone 10+ Mature 17+ Teen clusters
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.00 0.533333 HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462 HEALTH_AND_FITNESS 4.0 2490 3.8 500000 ... 0 0 0 0 0 0 1 0 0 0

2 rows × 57 columns

In [142]:
from wordcloud import WordCloud
wc = WordCloud(background_color="white", max_words=200, colormap="Set2")
In [148]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop = stop + ['app', 'APP' ,'ap', 'App', 'apps', 'application', 'browser', 'website', 'websites', 'chrome', 'click', 'web', 'ip', 'address',
            'files', 'android', 'browse', 'service', 'use', 'one', 'download', 'email', 'Launcher']


df['Translated_Review'] = df['Translated_Review'].apply(lambda x: " ".join(x for x in str(x).split(' ') if x not in stop))

df.dropna(subset=['Translated_Review'], inplace=True)

good = df.loc[df.Rating>4.5]['Translated_Review'].apply(lambda x: '' if x=='nan' else x)

wc.generate(''.join(str(good)))

plt.figure(figsize=(10, 10))
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
<Figure size 720x720 with 0 Axes>
In [150]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop = stop + ['app', 'APP' ,'ap', 'App', 'apps', 'application', 'browser', 'website', 'websites', 'chrome', 'click', 'web', 'ip', 'address',
            'files', 'android', 'browse', 'service', 'use', 'one', 'download', 'email', 'Launcher']


df['Translated_Review'] = df['Translated_Review'].apply(lambda x: " ".join(x for x in str(x).split(' ') if x not in stop))

df.dropna(subset=['Translated_Review'], inplace=True)

bad = df.loc[df.Rating<4.0]['Translated_Review'].apply(lambda x: '' if x=='nan' else x)

wc.generate(''.join(str(bad)))

plt.figure(figsize=(70, 70))
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
<Figure size 5040x5040 with 0 Axes>
In [ ]:
 
In [ ]: